Topic distribution

Using the bag of words generated from the corpus, and the topic models generated, I now use them to project publications in the LDA topic space, as well as for the individuals in the School

Setup

In [14]:
import pandas as pd
import numpy as np
import cPickle as pkl
In [20]:
import matplotlib.pyplot as plt
%matplotlib inline
rng = np.random.RandomState(1234567)
In [2]:
from gensim import models
from gensim.corpora import Dictionary
In [3]:
lookup_pub = pkl.load(open('../infnet-analysis/data/lookup_pub.pkl', 'rb'))
lookup_poinf = pkl.load(open('../infnet-analysis/data/lookup_poinf.pkl','rb'))
pub_toks = pkl.load(open('../infnet-scrapper/data/pub_toks.pkl','rb'))
In [5]:
from hdbscan import HDBSCAN ## TRYING WITH HDBSCAN:  http://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html

Topic Distribution by Publications

In [6]:
# Load Dictionary to convert words to id:
dictionary = pkl.load(open('../topicModel/dictionary.pkl','rb'))
In [7]:
# Convert tokens to bow:
bowified = lambda row: dictionary.doc2bow(row.summary_toks)
pub_toks['bow'] = pub_toks.apply(bowified, axis=1)
In [9]:
pub_toks.head(4)
Out[9]:
year summary_toks bow
pub_id
400818dc-63af-4a26-80c5-906f98e1f8ab 1989 [balloon, stabil, analysi, jet, hmode, dischar... [(40, 2), (156, 1), (223, 2), (293, 2), (328, ...
18b1a861-afef-4fff-bc80-d02e05be18c4 2013 [queri, process, data, integr, chapter, illust... [(40, 1), (229, 1), (253, 1), (319, 2), (330, ...
309fdbfc-227b-4588-9264-f0f4e3cadfcb 1994 [comprehens, syntax, syntax, comprehens, close... [(90, 3), (123, 1), (186, 1), (189, 5), (196, ...
d5814bab-5fc2-4c31-92b7-543c7ce75cb4 2012 [evalu, speaker, verif, secur, detect, hmmbase... [(18, 1), (28, 1), (29, 2), (30, 1), (63, 1), ...
In [8]:
# load the LDA models:
fullpubLDA = models.LdaModel.load('fullpub.ldamodel')
In [9]:
def inference(ldaModel, ldaVector):
    num_topics = ldaModel.num_topics
    topic_dist = ldaModel[ldaVector]
    
    # index the topic_distribution according to the distribution:
    out = [0]*num_topics
    for (i,v) in topic_dist:
        out[i] = v
    assert len(out) == num_topics
    return out
In [10]:
_inference = lambda row: inference(fullpubLDA, row.bow)
pub_toks['topic_distribution'] = pub_toks.apply(_inference, axis=1)
In [15]:
pub_toks.head(4)
Out[15]:
year summary_toks bow topic_distribution
pub_id
400818dc-63af-4a26-80c5-906f98e1f8ab 1989 [balloon, stabil, analysi, jet, hmode, dischar... [(40, 2), (156, 1), (223, 2), (293, 2), (328, ... [0, 0.0982922003593, 0, 0.0248418858752, 0.422...
18b1a861-afef-4fff-bc80-d02e05be18c4 2013 [queri, process, data, integr, chapter, illust... [(40, 1), (229, 1), (253, 1), (319, 2), (330, ... [0, 0, 0, 0, 0, 0.153429075164, 0, 0, 0.341365...
309fdbfc-227b-4588-9264-f0f4e3cadfcb 1994 [comprehens, syntax, syntax, comprehens, close... [(90, 3), (123, 1), (186, 1), (189, 5), (196, ... [0, 0.298135265913, 0, 0, 0, 0, 0, 0, 0, 0.181...
d5814bab-5fc2-4c31-92b7-543c7ce75cb4 2012 [evalu, speaker, verif, secur, detect, hmmbase... [(18, 1), (28, 1), (29, 2), (30, 1), (63, 1), ... [0, 0, 0, 0, 0.132472201269, 0, 0, 0, 0.122855...
In [11]:
def best_topic(topic_dist):
    """
    Assign the pulbication with the topic that best describes it;
    this is equivalent to the index that have the highest topic_distribution
    """
    a = np.argmax(topic_dist)
    assert a < 20
    assert a >= 0
    return a
In [15]:
pub_toks['best_topic'] = pub_toks.apply(lambda row: best_topic(row.topic_distribution), axis=1)
In [92]:
pub_toks.head(4)
Out[92]:
year summary_toks bow topic_distribution best_topic
pub_id
400818dc-63af-4a26-80c5-906f98e1f8ab 1989 [balloon, stabil, analysi, jet, hmode, dischar... [(40, 2), (156, 1), (223, 2), (293, 2), (328, ... [0, 0.0982922003593, 0, 0.0248418858752, 0.422... 4
18b1a861-afef-4fff-bc80-d02e05be18c4 2013 [queri, process, data, integr, chapter, illust... [(40, 1), (229, 1), (253, 1), (319, 2), (330, ... [0, 0, 0, 0, 0, 0.153429075164, 0, 0, 0.341365... 18
309fdbfc-227b-4588-9264-f0f4e3cadfcb 1994 [comprehens, syntax, syntax, comprehens, close... [(90, 3), (123, 1), (186, 1), (189, 5), (196, ... [0, 0.298135265913, 0, 0, 0, 0, 0, 0, 0, 0.181... 18
d5814bab-5fc2-4c31-92b7-543c7ce75cb4 2012 [evalu, speaker, verif, secur, detect, hmmbase... [(18, 1), (28, 1), (29, 2), (30, 1), (63, 1), ... [0, 0, 0, 0, 0.132472201269, 0, 0, 0, 0.122855... 13

Clustering

In [16]:
## Our dataset will be the topic_distribution:
data = pub_toks.topic_distribution.values

Visualisation

We can visualise the data in 2D, and color each publication based on the most salient topic it is on:

In [ ]:
from sklearn import manifold

TSNE

Here, use the manifold package from sklearn to reduce the dimensionality of the data for visualisation. For coloring, use the topic that gives the highest probability

In [27]:
x_components = manifold.TSNE(n_components=2, init='pca', random_state=rng).fit_transform(data)
In [103]:
f = plt.figure(figsize=(10,10))
ax = f.add_subplot(111)
ax.scatter(x_components[:,0], x_components[:,1], c=list(pub_toks.best_topic), cmap=plt.cm.jet, s=8)
ax.legend(loc='best')
ax.axis('off')
plt.show()

MDS

In [218]:
# Multidimensional Scaling
mds_components = manifold.MDS(n_components=2, random_state=rng).fit_transform(data)
In [219]:
f = plt.figure(figsize=(10,10))
ax = f.add_subplot(111)
ax.scatter(mds_components[:,0], mds_components[:,1], c=list(pub_toks.best_topic), cmap=plt.cm.jet, s=8)
ax.legend(loc='best')
ax.axis('off')
plt.show()

While the colors of the publications are based on the topic the publication is salient on (from the LDA), we make no assumption that these are the clusters that are being used.

Hence, now we use some clustering algorithm to cluster our data such that we can color them accordingly. In our clustering algorithms, we have 20 clusters as well, similar to the topics.

KMeans

In [25]:
from sklearn.cluster import KMeans

n_cluster = 20

In [26]:
kmeansClustering = KMeans(n_clusters=20).fit_predict(data)
In [28]:
f = plt.figure(figsize=(20,10))
ax = f.add_subplot(121)
ax.scatter(x_components[:,0], x_components[:,1], c=kmeansClustering, cmap=plt.cm.jet, s=8)
ax.legend(loc='best')
ax.axis('off')
ax.set_title('clustering based on kMeans (using TSNE)')
ax2 = f.add_subplot(122)
ax2.scatter(x_components[:,0], x_components[:,1], c=list(pub_toks.best_topic), cmap=plt.cm.jet, s=8)
ax2.legend(loc='best')
ax2.axis('off')
ax2.set_title('clustering based on best topic')
plt.show()
In [ ]:
f = plt.figure(figsize=(20,10))
ax = f.add_subplot(121)
ax.scatter(mds_components[:,0], mds_components[:,1], c=kmeansClustering, cmap=plt.cm.jet, s=20)
ax.legend(loc='best')
ax.axis('off')
ax.set_title('Clustering baed on kMeans (mds)')

ax2 = f.add_subplot(122)
ax2.scatter(mds_components[:,0], mds_components[:,1], c=list(pub_toks.best_topic), cmap=plt.cm.jet, s=20)
ax2.legend(loc='best')
ax2.axis('off')
ax2.set_title('Clustering based on best_topic (mds)')
plt.show()
plt.show()

n_cluster = 30

In [29]:
kmeansClustering30 = KMeans(n_clusters=30).fit_predict(data)
In [30]:
f = plt.figure(figsize=(20,10))
ax = f.add_subplot(121)
ax.scatter(x_components[:,0], x_components[:,1], c=kmeansClustering30, cmap=plt.cm.jet, s=8)
ax.legend(loc='best')
ax.axis('off')
ax.set_title('clustering based on kMeans (using TSNE)')
ax2 = f.add_subplot(122)
ax2.scatter(x_components[:,0], x_components[:,1], c=list(pub_toks.best_topic), cmap=plt.cm.jet, s=8)
ax2.legend(loc='best')
ax2.axis('off')
ax2.set_title('clustering based on best topic')
plt.show()

n_cluster = 10

In [33]:
kmeansClustering10 = KMeans(n_clusters=10).fit_predict(data)
In [34]:
f = plt.figure(figsize=(20,10))
ax = f.add_subplot(121)
ax.scatter(x_components[:,0], x_components[:,1], c=kmeansClustering10, cmap=plt.cm.jet, s=8)
ax.legend(loc='best')
ax.axis('off')
ax.set_title('clustering based on kMeans (using TSNE)')
ax2 = f.add_subplot(122)
ax2.scatter(x_components[:,0], x_components[:,1], c=list(pub_toks.best_topic), cmap=plt.cm.jet, s=8)
ax2.legend(loc='best')
ax2.axis('off')
ax2.set_title('clustering based on best topic')
plt.show()

DBScan

In [63]:
from sklearn.cluster import DBSCAN
In [62]:
dbscan_pub = DBSCAN().fit(data)
In [64]:
dbscan_clusters = dbscan_pub.labels_
n_clusters_ = len(set(dbscan_clusters)) - (1 if -1 in dbscan_clusters else 0)
print 'number of clusters:', n_clusters_
number of clusters: 1
In [65]:
dbscan_pub = DBSCAN().fit(x_components)
In [66]:
dbscan_clusters = dbscan_pub.labels_
n_clusters_ = len(set(dbscan_clusters)) - (1 if -1 in dbscan_clusters else 0)
print 'number of clusters:', n_clusters_
number of clusters: 203
In [170]:
f = plt.figure(figsize=(20, 10))
ax = f.add_subplot(121)
color_palette = sns.color_palette('Paired', 203)
cluster_colors = [
    color_palette[x] if x >= 0 else (0.5, 0.5, 0.5) for x in dbscan_clusters
]
ax.scatter(
    x_components[:, 0], x_components[:, 1], c=cluster_colors, s=50, alpha=.25)
ax.legend(loc='best')
ax.axis('off')
ax.set_title('clustering based on hdbscan (using TSNE)')
ax2 = f.add_subplot(122)
ax2.scatter(
    x_components[:, 0],
    x_components[:, 1],
    c=list(pub_toks.best_topic),
    cmap=plt.cm.jet,
    s=50,
    alpha=.25)
ax2.legend(loc='best')
ax2.axis('off')
ax2.set_title('clustering based on best topic')
plt.show()
In [220]:
dbscan_pub_mds = DBSCAN().fit(mds_components)
In [221]:
dbscan_pub_mds_clusters = dbscan_pub_mds.labels_
n_clusters_ = len(set(dbscan_pub_mds_clusters)) - (1 if -1 in dbscan_pub_mds_clusters else 0)
print 'number of clusters:', n_clusters_
number of clusters: 1

HDBSCAN

In [83]:
from hdbscan import HDBSCAN
import hdbscan
In [41]:
hdbscan_cluster = HDBSCAN().fit(data)
In [44]:
n_clusters_ = len(set(hdbscan_cluster.labels_)) - (1 if -1 in hdbscan_cluster.labels_ else 0)
print 'number of clusters:', n_clusters_
number of clusters: 102
In [60]:
f = plt.figure()
ax = f.add_subplot(111)
ax.hist(hdbscan_cluster.labels_,bins=50);
plt.show()
In [169]:
f = plt.figure(figsize=(20, 10))
ax = f.add_subplot(121)
color_palette = sns.color_palette('Paired', 102)
cluster_colors = [
    color_palette[x] if x >= 0 else (0.5, 0.5, 0.5)
    for x in hdbscan_cluster.labels_
]
#  colors weighted according to te probability of being in the cluster
cluster_member_colors = [
    sns.desaturate(x, p)
    for x, p in zip(cluster_colors, hdbscan_cluster.probabilities_)
]
ax.scatter(
    *x_components.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

# ax.scatter(x_components[:,0], x_components[:,1], c=hdbscan_cluster.labels_, cmap=plt.cm.jet, s=8)
ax.legend(loc='best')
ax.axis('off')
ax.set_title('clustering based on HDBSCAN (using TSNE)')
ax2 = f.add_subplot(122)
ax2.scatter(
    x_components[:, 0],
    x_components[:, 1],
    c=list(pub_toks.best_topic),
    cmap=plt.cm.jet,
    s=50,
    alpha=.25)
ax2.legend(loc='best')
ax2.axis('off')
ax2.set_title('clustering based on best topic')
plt.show()

Soft clustering in HDBSCAN*

In [78]:
clusterer = HDBSCAN(prediction_data=True).fit(data)
In [79]:
n_clusters_ = len(set(hdbscan_cluster.labels_)) - (1 if -1 in hdbscan_cluster.labels_ else 0)
print 'number of clusters:', n_clusters_
number of clusters: 102
In [81]:
f = plt.figure()
ax = f.add_subplot(111)
ax.hist(clusterer.labels_,bins=50);
plt.show()
In [168]:
f = plt.figure(figsize=(20, 20))
ax = f.add_subplot(221)

soft_clusters = hdbscan.all_points_membership_vectors(clusterer)
color_palette = sns.color_palette('Paired', 102)
cluster_colors = [color_palette[np.argmax(x)] for x in soft_clusters]
ax.scatter(*x_components.T, s=50, linewidth=0, c=cluster_colors, alpha=0.25)

# ax.scatter(x_components[:,0], x_components[:,1], c=hdbscan_cluster.labels_, cmap=plt.cm.jet, s=8)
ax.legend(loc='best')
ax.axis('off')
ax.set_title('soft clustering based on HDBSCAN (using TSNE)')
ax2 = f.add_subplot(222)
ax2.scatter(
    x_components[:, 0],
    x_components[:, 1],
    c=list(pub_toks.best_topic),
    cmap=plt.cm.jet,
    s=50,
    alpha=.25)
ax2.legend(loc='best')
ax2.axis('off')
ax2.set_title('clustering based on best topic')

ax3 = f.add_subplot(223)
cluster_colors = [
    sns.desaturate(color_palette[np.argmax(x)], np.max(x))
    for x in soft_clusters
]
ax3.scatter(*x_components.T, s=50, linewidth=0, c=cluster_colors, alpha=0.25)
ax3.axis('off')
ax3.set_title(
    'soft clustering based on HDBSCAN, desaturated based on cluster probability (using TSNE)'
)
plt.show()

Topic Distribution of PoInf

In [92]:
# Create a new pandas table that merge the lookup_poinf and lookup_pub
# each pub in lookup_pub have a collab_id that have a list of collaborators by id
# We can ignore those that are not in the list of id for PoInf

# Create the list of ids for easy checking:
poinf_id = set(lookup_poinf.index)

# we can now create such an index:
pub_mapping = {str(_id):set() for _id in list(poinf_id)}

for row in lookup_pub.iterrows():
    pub_id = row[0]
    collab_ids = row[1]['collab_id']
    for _id in collab_ids:
        if _id in poinf_id:
            pub_mapping[_id].add(pub_id)
In [93]:
row_list = [{'id':k, 'pub_ids':v} for (k,v) in pub_mapping.items()]
In [94]:
# Add these pub_ids to the pandas df:
df_pubmapping = pd.DataFrame(row_list)
In [95]:
lookup_poinf_more = lookup_poinf.join(df_pubmapping.set_index('id'))
In [37]:
lookup_poinf_more.iloc[20:24]
Out[37]:
last_name first_name perseonal_url position parent institute full_name institute_class alias pub_ids toks
id
0ed800f5-a3a0-47d7-a8b3-f97a4f2b6931 steuwer michel http://www.research.ed.ac.uk/portal/en/persons... unknown institute for computing systems architecture laboratory for foundations of computer science steuwer michel 3 steuwer, m. {20cb2fdd-6d93-40b9-9cab-e9d818eb166e, b74a3be... [[highlevel, program, medic, imag, multigpu, s...
102286ee-5f21-4aed-abfd-e4ea1a615223 oberlander jon http://www.research.ed.ac.uk/portal/en/persons... professor school of informatics institute of language cognition and computation oberlander jon 2 oberlander, j. {5be3a6b1-5ee4-4a39-9fff-88b22238fb98, 4629e88... [[verbal, effect, visual, program, inform, typ...
10ff8e7a-53b2-4d2f-adad-ef695bc595a7 wen zhenyu http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics laboratory for foundations of computer science wen zhenyu 3 wen, z. {872e450f-87e9-4956-9678-5b09f3cd4f84, a4cad99... [[cost, effici, schedul, mapreduc, applic, pub...
114a78ef-c940-4653-8429-7c2897a96043 jovanovic jelena http://www.research.ed.ac.uk/portal/en/persons... visitor official visitor school of informatics institute of language cognition and computation jovanovic jelena 2 jovanovic, j. {b4b0b45d-72d9-4f39-929c-ea451288f253} [[analyticsbas, framework, support, teach, lea...
In [96]:
def getToks(pub_ids):
    out = []
    try:
        if len(pub_ids):
            for pub_id in pub_ids:
                out.extend(pub_toks[pub_toks.index == pub_id].summary_toks)
            # Convert the list of lists to a single list:
            out = [tok for tokList in out for tok in tokList]
    except TypeError:
        print(pub_ids)
        
    return out
In [97]:
lookup_poinf_more['summary_toks'] = lookup_poinf_more.apply(lambda row: getToks(row.pub_ids), axis=1)
nan
nan
nan
nan
In [98]:
# Conert to BOW using bowified:
lookup_poinf_more['bow'] = lookup_poinf_more.apply(bowified, axis=1)
In [99]:
lookup_poinf_more['topic_distribution'] = lookup_poinf_more.apply(_inference, axis=1)
In [51]:
lookup_poinf_more.head(2)
Out[51]:
last_name first_name perseonal_url position parent institute full_name institute_class alias pub_ids toks summary_toks bow topic_distribution
id
003ec9bb-18aa-4e6e-95e9-359f0968262a gray gavin http://www.research.ed.ac.uk/portal/en/persons... research assistant school of informatics institute for computing systems architecture gray gavin 5 NaN {} [] [] [] [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.0...
010f9bf0-c04c-4cfb-ab3d-ca150de1e706 jackson paul http://www.research.ed.ac.uk/portal/en/persons... senior lecturer school of informatics institute for computing systems architecture jackson paul 5 jackson, p. b.|jackson, p. {c5754b06-fcf9-4362-aa3a-1142589b5402, 167c4b6... [nuprl, use, circuit, design, nuprl, interact,... [nuprl, use, circuit, design, nuprl, interact,... [(22, 2), (27, 1), (29, 15), (39, 1), (40, 1),... [0, 0.326083956289, 0.120584281776, 0.05047796...
In [134]:
len(lookup_poinf_more)
Out[134]:
296
In [101]:
lookup_poinf_more['remove_drop'] = lookup_poinf_more.apply(lambda row: len(row.bow) == 0, axis=1)
In [102]:
# Remove individuals that does not have any bow:
lookup_poinf_more_drop = lookup_poinf_more.drop(lookup_poinf_more[lookup_poinf_more.remove_drop==True].index)
In [103]:
len(lookup_poinf_more_drop)
Out[103]:
219
In [135]:
lookup_poinf_more_drop['best_topic'] = lookup_poinf_more_drop.apply(lambda row: best_topic(row.topic_distribution), axis=1)
/Users/weiting/miniconda3/envs/py27/lib/python2.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [154]:
lookup_poinf_more_drop.head(2)
Out[154]:
last_name first_name perseonal_url position parent institute full_name institute_class alias pub_ids toks summary_toks bow topic_distribution remove_drop best_topic
id
010f9bf0-c04c-4cfb-ab3d-ca150de1e706 jackson paul http://www.research.ed.ac.uk/portal/en/persons... senior lecturer school of informatics institute for computing systems architecture jackson paul 5 jackson, p. b.|jackson, p. {c5754b06-fcf9-4362-aa3a-1142589b5402, 167c4b6... [nuprl, use, circuit, design, nuprl, interact,... [nuprl, use, circuit, design, nuprl, interact,... [(22, 2), (27, 1), (29, 15), (39, 1), (40, 1),... [0, 0.326083956289, 0.120584281776, 0.05047796... False 1
02c86de2-0fc9-4f6d-aee9-93b0f7557c84 franke bjoern http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics institute of language cognition and computation franke bjoern 2 franke, b. {9a3368cc-e69d-4ecf-bad1-b43ab0ac89a8, ab3fccd... [use, genet, program, sourcelevel, data, assig... [use, genet, program, sourcelevel, data, assig... [(1, 1), (3, 5), (6, 2), (10, 1), (11, 1), (13... [0.0467349416537, 0, 0.0134058642663, 0.685572... False 3

Clustering

In [136]:
## Our dataset will be the topic_distribution:
data_poinf = lookup_poinf_more_drop.topic_distribution.values
In [137]:
data_poinf = list(data_poinf)
In [139]:
poinf_tsne = manifold.TSNE(n_components=2, init='pca', random_state=rng).fit_transform(data_poinf)

Visualisation

TSNE

In [140]:
f = plt.figure(figsize=(10, 10))
ax = f.add_subplot(111)
ax.scatter(
    poinf_tsne[:, 0],
    poinf_tsne[:, 1],
    c=list(lookup_poinf_more_drop.best_topic),
    cmap=plt.cm.jet,
    s=50,
    alpha=.5)
ax.legend(loc='best')
ax.axis('off')
plt.show()
In [142]:
poinf_tsne.shape
Out[142]:
(219, 2)

MDS

In [115]:
# Multidimensional Scaling
mds_poinf = manifold.MDS(n_components=2, random_state=rng).fit_transform(data_poinf)
In [119]:
f = plt.figure(figsize=(10,10))
ax = f.add_subplot(111)
ax.scatter(mds_poinf[:,0], mds_poinf[:,1], c=list(lookup_poinf_more_drop.best_topic), cmap=plt.cm.jet, s=50, alpha=.5)
ax.legend(loc='best')
ax.axis('off')
plt.show()

K means

In [143]:
kmeansClustering = KMeans(n_clusters=20).fit_predict(data_poinf)
In [144]:
f = plt.figure(figsize=(20, 10))
ax = f.add_subplot(121)
ax.scatter(
    poinf_tsne[:, 0],
    poinf_tsne[:, 1],
    c=kmeansClustering,
    cmap=plt.cm.jet,
    s=50,
    alpha=.5)
ax.legend(loc='best')
ax.axis('off')
ax.set_title('clustering based on kMeans (using TSNE)')
ax2 = f.add_subplot(122)
ax2.scatter(
    poinf_tsne[:, 0],
    poinf_tsne[:, 1],
    c=list(lookup_poinf_more_drop.best_topic),
    cmap=plt.cm.jet,
    s=50,
    alpha=.5)
ax2.legend(loc='best')
ax2.axis('off')
ax2.set_title('clustering based on best topic')
plt.show()
In [123]:
f = plt.figure(figsize=(20, 10))
ax = f.add_subplot(121)
ax.scatter(
    mds_poinf[:, 0],
    mds_poinf[:, 1],
    c=kmeansClustering,
    cmap=plt.cm.jet,
    s=50,
    alpha=.5)
ax.legend(loc='best')
ax.axis('off')
ax.set_title('Clustering baed on kMeans (mds)')

ax2 = f.add_subplot(122)
ax2.scatter(
    mds_poinf[:, 0],
    mds_poinf[:, 1],
    c=list(lookup_poinf_more_drop.best_topic),
    cmap=plt.cm.jet,
    s=50,
    alpha=.5)
ax2.legend(loc='best')
ax2.axis('off')
ax2.set_title('Clustering based on best_topic (mds)')
plt.show()
plt.show()

DBScan

In [190]:
dbscan = DBSCAN(min_samples=1, algorithm='ball_tree', leaf_size=2).fit(data_poinf)
In [191]:
dbscan_clusters = dbscan.labels_
n_clusters_ = len(set(dbscan_clusters)) - (1 if -1 in dbscan_clusters else 0)
print 'number of clusters:', n_clusters_
In [193]:
dbscan_clusters
Out[193]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
In [197]:
# Try with the tsne data:
dbscan_tsne = DBSCAN().fit(x)
In [198]:
dbscan_tsne_clusters = dbscan_tsne.labels_
In [199]:
n_clusters_tsne = len(set(dbscan_tsne_clusters)) - (1 if -1 in dbscan_tsne_clusters else 0)
print 'number of clusters:', n_clusters_
number of clusters: 1

HDBSCAN

In [124]:
hdbscan_cluster_poinf = HDBSCAN().fit(data_poinf)
In [125]:
n_clusters_ = len(set(hdbscan_cluster_poinf.labels_)) - (1 if -1 in hdbscan_cluster_poinf.labels_ else 0)
print 'number of clusters:', n_clusters_


f = plt.figure()
ax = f.add_subplot(111)
ax.hist(hdbscan_cluster_poinf.labels_,bins=50);
plt.show()
number of clusters: 2
In [145]:
f = plt.figure(figsize=(20,10))
ax = f.add_subplot(121)
color_palette = sns.color_palette('husl', 103)
cluster_colors = [color_palette[x] if x >= 0
                  else (0.5, 0.5, 0.5)
                  for x in hdbscan_cluster_poinf.labels_]
#  colors weighted according to te probability of being in the cluster
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, hdbscan_cluster_poinf.probabilities_)] 
ax.scatter(*poinf_tsne.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

# ax.scatter(x_components[:,0], x_components[:,1], c=hdbscan_cluster.labels_, cmap=plt.cm.jet, s=8)
ax.legend(loc='best')
ax.axis('off')
ax.set_title('clustering based on HDBSCAN (using TSNE)')

ax2 = f.add_subplot(122)
ax2.scatter(poinf_tsne[:,0], poinf_tsne[:,1], c=list(lookup_poinf_more_drop.best_topic), cmap=plt.cm.jet, s=50, alpha=.5)
ax2.legend(loc='best')
ax2.axis('off')
ax2.set_title('clustering based on best topic')

plt.show()

Soft Clustering of poInf

In [173]:
cluster_poinf = HDBSCAN(prediction_data=True).fit(data_poinf)
In [174]:
n_clusters_ = len(set(cluster_poinf.labels_)) - (1 if -1 in cluster_poinf.labels_ else 0)
print 'number of clusters:', n_clusters_

f = plt.figure()
ax = f.add_subplot(111)
ax.hist(cluster_poinf.labels_,bins=50);
plt.show()
number of clusters: 2
In [175]:
f = plt.figure(figsize=(16, 16))
ax = f.add_subplot(221)

soft_clusters = hdbscan.all_points_membership_vectors(cluster_poinf)
color_palette = sns.color_palette('husl', 2)
cluster_colors = [color_palette[np.argmax(x)] for x in soft_clusters]
ax.scatter(*poinf_tsne.T, s=70, linewidth=0, c=cluster_colors, alpha=0.5)
# ax.scatter(x_components[:,0], x_components[:,1], c=hdbscan_cluster.labels_, cmap=plt.cm.jet, s=8)
ax.legend(loc='best')
ax.axis('off')
ax.set_title('soft clustering based on HDBSCAN (using TSNE)')

ax2 = f.add_subplot(222)
ax2.scatter(
    poinf_tsne[:, 0],
    poinf_tsne[:, 1],
    c=list(lookup_poinf_more_drop.best_topic),
    cmap=plt.cm.jet,
    s=70,
    alpha=.5)
ax2.legend(loc='best')
ax2.axis('off')
ax2.set_title('clustering based on best topic')

ax3 = f.add_subplot(223)
cluster_colors = [
    sns.desaturate(color_palette[np.argmax(x)], np.max(x))
    for x in soft_clusters
]
ax3.scatter(*poinf_tsne.T, s=70, linewidth=0, c=cluster_colors, alpha=0.5)
ax3.axis('off')
ax3.set_title(
    'soft clustering based on HDBSCAN, desaturated based on cluster probability (using TSNE)'
)
plt.show()
In [176]:
soft_clusters.shape
Out[176]:
(219, 2)
In [182]:
classDef = np.argmax(soft_clusters,axis=1)
In [184]:
class1 = classDef[classDef==0]
In [185]:
class2 = classDef[classDef==1]
In [190]:
lookup_poinf_more_drop.iloc[classDef==1]
Out[190]:
last_name first_name perseonal_url position parent institute full_name institute_class alias pub_ids summary_toks bow topic_distribution remove_drop best_topic
id
010f9bf0-c04c-4cfb-ab3d-ca150de1e706 jackson paul http://www.research.ed.ac.uk/portal/en/persons... senior lecturer school of informatics institute for computing systems architecture jackson paul 5 jackson, p. b.|jackson, p. {c5754b06-fcf9-4362-aa3a-1142589b5402, 167c4b6... [nuprl, use, circuit, design, nuprl, interact,... [(22, 2), (27, 1), (29, 15), (39, 1), (40, 1),... [0, 0.326083956289, 0.120584281776, 0.05047796... False 1
0346dc9e-e2a7-4523-8504-a74ef42a533b pullinger martin http://www.research.ed.ac.uk/portal/en/persons... senior researcher school of informatics laboratory for foundations of computer science pullinger martin 3 pullinger, m. {bdd3bcc2-1420-43ea-8c66-c2f3218f4dcb, 2428dec... [work, time, reduct, polici, sustain, economi,... [(14, 2), (22, 2), (24, 5), (38, 1), (45, 1), ... [0.368909432042, 0, 0.0119890660601, 0.0122734... False 0
03916cbc-3a54-4de4-be54-09c23f44dbb5 kalorkoti k http://www.research.ed.ac.uk/portal/en/persons... senior lecturer school of informatics laboratory for foundations of computer science kalorkoti k 3 kalorkoti, k.|kalorkoti, k. a. {1d11e0dd-919d-41da-8209-6d3e68c8e5fb, 1ad283d... [invert, polynomi, formal, power, seri, proble... [(4, 3), (6, 4), (44, 13), (58, 1), (80, 2), (... [0, 0.463266002725, 0.0259881837671, 0.0259327... False 1
0505dfc1-9fe5-4f4a-bab3-4af14ee69db3 goryanin igor http://www.research.ed.ac.uk/portal/en/persons... chair of systems biology school of informatics neuroinformatics dtc goryanin igor 6 goryanin, i.|goryanin, i. i. {5b8afa54-7f4b-4e68-9ab9-50899baf81f4, 8838f81... [semiautom, genom, annot, comparison, integr, ... [(3, 8), (11, 2), (14, 1), (16, 1), (23, 2), (... [0.39367571095, 0.013973154296, 0.015279689953... False 0
053590d0-39d7-4a42-b42d-61ee8d743d3e goddard nigel http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics institute of language cognition and computation goddard nigel 2 goddard, n.|goddard, n. h. {10c5a591-daea-4476-9cfe-f5729d1867d5, 2969d33... [increment, modelbas, discrimin, articul, move... [(2, 1), (11, 1), (14, 3), (16, 6), (22, 1), (... [0.14540591874, 0, 0, 0.0753922225709, 0.03718... False 19
05ed47ac-4c5e-4f9a-b7b2-45828eaad326 wallden petros http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics UNKNOWN wallden petros 0 wallden, p. {41422804-5cc2-4b53-bb8b-244b25e59867, b87ec5b... [robust, devic, independ, verifi, blind, quant... [(6, 1), (11, 6), (14, 28), (17, 2), (22, 1), ... [0, 0.423894446708, 0.0374311210368, 0.0178002... False 1
06760916-edca-488e-93da-baff6204a453 marina mahesh http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics neuroinformatics dtc marina mahesh 6 marina, m.|marina, m. k. {6307436e-528c-4313-a673-67d01edd2558, d9f7969... [virtual, dynam, backbon, mobil, ad, hoc, netw... [(0, 1), (3, 4), (11, 4), (14, 2), (15, 1), (1... [0.393498691038, 0, 0.024713294842, 0.09819444... False 0
087adc6d-e04b-4168-8824-2fa69f6b39e7 vaniea kami http://www.research.ed.ac.uk/portal/en/persons... lecturer in cyber security and privacy school of informatics laboratory for foundations of computer science vaniea kami 3 vaniea, k. e.|vaniea, k. {b32513b0-8461-49f6-8542-463e9c2b4e67, 8f66327... [loop, autom, softwar, updat, caus, unintend, ... [(3, 1), (11, 1), (14, 1), (16, 1), (17, 1), (... [0.0663397136443, 0, 0.0494119595467, 0, 0, 0.... False 5
09607fb1-44fb-4edc-a0ae-c54d3047be30 court robert http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics institute for adaptive and neural computation court robert 4 court, r.|court, r. c. {a771634e-3e03-4803-b5e8-722cf2e05b76, 7b82c40... [virtual, fli, brain, use, owl, support, map, ... [(49, 1), (79, 1), (96, 3), (165, 1), (197, 1)... [0, 0, 0.0351173122818, 0, 0, 0, 0, 0.05600933... False 11
09931c59-595a-487a-a38f-2b28fdc4e406 mayr richard http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics institute of perception action and behaviour mayr richard 7 mayr, r. {95f42e94-fc6c-4df6-b2c3-9a2ee163f295, c659192... [simul, onecount, net, pspacecomplet, one, cou... [(14, 1), (15, 6), (23, 1), (24, 8), (25, 1), ... [0.0214236608634, 0.440248555178, 0.0800491595... False 1
0cf9e165-3552-465d-8ce0-a3c8e56a0f77 grundkiewicz roman http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics institute of language cognition and computation grundkiewicz roman 2 grundkiewicz, r. {81f4d80d-1b10-48ef-9496-fb1e0931eaeb, 06132d6... [reinvestig, classifi, approach, articl, prepo... [(6, 1), (11, 1), (43, 1), (45, 2), (47, 1), (... [0, 0, 0.0700861393112, 0.0372858194772, 0, 0,... False 16
0d60e141-a185-4709-a2c0-b18af3a1b3e4 morgan evan http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics institute for adaptive and neural computation morgan evan 4 morgan, e. {b4f509a1-2619-4d89-a2e6-5798e8d8241d, 4d1e170... [enhanc, assembl, room, energi, live, lab, ple... [(11, 1), (14, 1), (23, 10), (24, 1), (45, 1),... [0.26265294627, 0, 0.052238031782, 0, 0, 0.162... False 0
0d66e947-585b-4eb0-975d-e8c2e2dc96b1 hartswood mark http://www.research.ed.ac.uk/portal/en/persons... unknown institute of language cognition and computation institute of language cognition and computation hartswood mark 2 hartswood, m. {78d4dbb5-0c81-4f69-b885-6a7db1f2414a, de681e1... [depend, red, hot, action, present, brief, obs... [(3, 3), (6, 1), (11, 1), (14, 5), (15, 1), (1... [0.120897442906, 0, 0.0461391874941, 0, 0.0315... False 19
0e635bd1-f0fb-4f16-a68a-d0457b80eed7 williams christopher http://www.research.ed.ac.uk/portal/en/persons... personal chair chair of machine learning school of informatics school of philosophy psychology and language s... williams christopher 10 williams, c.|williams, c. k. i. {3978b46a-856e-4727-9117-51101337d880, 053a9c0... [upper, bound, bayesian, error, bar, general, ... [(0, 1), (2, 2), (3, 2), (6, 1), (10, 3), (11,... [0.0470167468532, 0.0176632008886, 0.018042581... False 8
102286ee-5f21-4aed-abfd-e4ea1a615223 oberlander jon http://www.research.ed.ac.uk/portal/en/persons... professor school of informatics institute of language cognition and computation oberlander jon 2 oberlander, j. {5be3a6b1-5ee4-4a39-9fff-88b22238fb98, 4629e88... [verbal, effect, visual, program, inform, type... [(3, 4), (6, 2), (8, 7), (11, 5), (14, 9), (18... [0.0233073508681, 0.0257016683519, 0.012807107... False 9
10ff8e7a-53b2-4d2f-adad-ef695bc595a7 wen zhenyu http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics laboratory for foundations of computer science wen zhenyu 3 wen, z. {872e450f-87e9-4956-9678-5b09f3cd4f84, a4cad99... [cost, effici, schedul, mapreduc, applic, publ... [(0, 2), (16, 1), (40, 1), (45, 3), (64, 1), (... [0.0804057040473, 0.0285247038238, 0, 0.193369... False 19
114a78ef-c940-4653-8429-7c2897a96043 jovanovic jelena http://www.research.ed.ac.uk/portal/en/persons... visitor official visitor school of informatics institute of language cognition and computation jovanovic jelena 2 jovanovic, j. {b4b0b45d-72d9-4f39-929c-ea451288f253} [analyticsbas, framework, support, teach, lear... [(579, 1), (1036, 2), (1536, 1), (1582, 1), (1... [0, 0, 0, 0, 0, 0.180330494479, 0, 0, 0.354561... False 8
1458e0ed-f765-4bf7-a151-b3201e5a8ae8 klein ewan http://www.research.ed.ac.uk/portal/en/persons... unknown institute of language cognition and computation UNKNOWN klein ewan 0 klein, e. {8e9d233f-0e3c-4cc4-90bf-6e07e8240496, ddaf907... [extens, toolkit, comput, semant, paper, focus... [(0, 1), (3, 1), (6, 1), (8, 1), (14, 7), (15,... [0.035719560889, 0, 0.0438140937098, 0.0223373... False 19
15b544ff-f14e-4393-bee6-0aa38f4361b6 steedman mark http://www.research.ed.ac.uk/portal/en/persons... professor school of informatics laboratory for foundations of computer science steedman mark 3 steedman, m.|steedman, m. j. {c2005ff2-83e8-4145-97bb-00f0ef49d5fc, 723fff7... [comput, grammar, acquisit, child, data, use, ... [(1, 5), (3, 6), (6, 9), (9, 6), (10, 1), (11,... [0, 0.0681410156983, 0.0227942233463, 0.011471... False 9
1997d4be-de27-44bd-ad89-36e6f386225d alexandru cristina http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics centre for intelligent systems and their appli... alexandru cristina 1 alexandru, c.|alexandru, c. a.|alexandru, c-a. {1b97dc9d-169e-466b-90b9-0ae44c733453, ebe0d59... [design, social, machin, heart, manual, servic... [(22, 1), (23, 1), (26, 1), (52, 1), (77, 1), ... [0.190006252313, 0, 0.0103090378398, 0.0355267... False 19
1a05e95b-6e1a-40d8-8406-d5e62b7c722d banks chris http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics laboratory for foundations of computer science banks chris 3 banks, c. j.|banks, c. {aece86e1-ac7e-4bd9-8065-0e6386906749, 03e6525... [function, transcript, factor, target, discove... [(29, 4), (35, 1), (49, 2), (82, 1), (90, 4), ... [0.228395371221, 0.287398801975, 0.04244300763... False 1
1a776ed4-3d64-416c-b80b-f84dbe849899 karaiskos vasilios http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics institute for adaptive and neural computation karaiskos vasilios 4 karaiskos, v. {561bf16e-ba10-4f45-8034-5ef8bd3f4094, 88be6fe... [toward, graphophonolog, pars, corpus, mediev,... [(9, 2), (38, 1), (40, 1), (47, 7), (50, 2), (... [0.0214511848851, 0, 0.0972912745996, 0, 0.037... False 5
1b064412-0109-4090-a60f-bafcbbad74be kilgour jonathan http://www.research.ed.ac.uk/portal/en/persons... research fellow school of informatics institute of language cognition and computation kilgour jonathan 2 kilgour, j. {398c1e16-d5ee-4aa3-b003-f0b893312379, 498b542... [automat, content, link, speechbas, justintim,... [(3, 2), (6, 1), (11, 1), (18, 5), (23, 2), (4... [0.0278806644074, 0, 0.0312487115957, 0.017780... False 5
1bbcff05-f819-4c25-a2a0-a03d793b788c shillcock richard http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics UNKNOWN shillcock richard 0 shillcock, r. c.|shillcock, r.|shillcock, r. ... {4eea78cd-5c0f-47d8-a3d6-192aeb0f1bcf, f8258c6... [competitor, effect, lexic, access, chase, zip... [(3, 6), (6, 1), (9, 26), (11, 1), (14, 1), (1... [0, 0, 0, 0.0108369309839, 0.0348145296628, 0.... False 9
1c6a1e04-4291-426e-a921-42be34ba8494 cheney james http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics laboratory for foundations of computer science cheney james 3 cheney, j. (ed.)|cheney, j.|cheney, j. r. {d502cba7-e4fe-4227-8bfe-82cea04ea393, b325d9c... [causal, semant, proven, proven, inform, sourc... [(3, 1), (6, 3), (8, 1), (11, 4), (14, 12), (1... [0, 0.151990010215, 0.0457461278132, 0.0348077... False 18
1d499fa8-54c4-4217-904e-9b5e2bcefc61 chen wei http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics laboratory for foundations of computer science chen wei 3 chen, w. {7f45bf9b-583a-4814-b280-9675457cb0df, 1453545... [robust, malwar, classifi, verifi, unwant, beh... [(29, 5), (46, 2), (67, 2), (70, 5), (79, 1), ... [0.112776737498, 0.214646332241, 0.08889140549... False 1
204097e4-072c-4df3-bb1d-f45380908692 haddow barry http://www.research.ed.ac.uk/portal/en/persons... senior research fellow school of informatics UNKNOWN haddow barry 0 haddow, b. {e390989e-217c-4180-b20c-202e50c3af96, b67b7c8... [exploit, multipli, annot, corpora, biomed, in... [(3, 1), (9, 1), (12, 2), (16, 5), (27, 1), (4... [0.0135148599378, 0, 0.0147301589719, 0.016830... False 16
213fbdf4-a67d-4b06-8614-3b864f740266 bradfield julian http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics laboratory for foundations of computer science bradfield julian 3 bradfield, j. c.|bradfield, j.|bradfield, j. ... {d558ba90-2dc4-44ad-895a-983625c7e9e8, 8cb8582... [fixpoint, game, differ, hierarchi, draw, anal... [(6, 2), (9, 10), (11, 2), (15, 1), (20, 1), (... [0, 0.69609280002, 0.082707908312, 0, 0, 0, 0.... False 1
23541f3f-42a3-4637-a43f-603d8bf9372b petillot yvan http://www.research.ed.ac.uk/portal/en/persons... visitor official visitor school of informatics laboratory for foundations of computer science petillot yvan 3 petillot, y. {baeb926c-023c-4dac-ad68-3bfc27b4505d, bf0268e... [direct, visual, slam, fuse, propriocept, huma... [(63, 1), (90, 1), (130, 1), (157, 1), (164, 4... [0.0116314564844, 0, 0, 0.0258784428589, 0.260... False 15
2374a5c7-54ec-4513-9c13-a3787724420b sorokina oksana http://www.research.ed.ac.uk/portal/en/persons... senior researcher school of informatics laboratory for foundations of computer science sorokina oksana 3 sorokina, o. v.|sorokina, o. {ac5ee121-76f4-4b2d-a7b9-d070a132e99a, 8946ed0... [system, biolog, approach, parkinson, diseas, ... [(11, 2), (17, 2), (23, 1), (24, 3), (40, 6), ... [0.216145685754, 0.0187393888104, 0.0506680404... False 11
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
d9cfbd11-051e-4b9e-8120-b97486cc4263 eshky aciel http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics laboratory for foundations of computer science eshky aciel 3 eshky, a. {56cff791-c8d9-47c5-88d2-5995691ceb37, 123944f... [generat, model, user, simul, spatial, navig, ... [(11, 1), (37, 1), (77, 1), (114, 1), (184, 1)... [0.116499541731, 0, 0, 0, 0, 0.0799424180557, ... False 8
da3124d7-44f9-4199-82bf-9e964f422ab9 armstrong douglas http://www.research.ed.ac.uk/portal/en/persons... personal chair in systems neurobiology school of informatics institute for adaptive and neural computation armstrong douglas 4 armstrong, d. g.|armstrong, d. j.|armstrong, d... {7be45170-0d7c-44d9-96c6-0f3c6dbc2c69, 31bc88b... [element, nonelement, olfactori, learn, drosop... [(3, 2), (11, 14), (15, 1), (17, 3), (23, 10),... [0.154212425712, 0, 0.0134456324792, 0, 0.0364... False 11
dbb07214-3753-4bba-a630-e1857e7efb09 arvind d k http://www.research.ed.ac.uk/portal/en/persons... chair of distributed wireless computation school of informatics institute for computing systems architecture arvind d k 5 arvind, d.|arvind, d. k. {32f25fcc-3094-4b83-8602-36d7654bbc3b, 4955cb1... [detect, communicationrel, error, concurr, pro... [(1, 1), (3, 7), (11, 1), (14, 4), (15, 1), (1... [0.296709631275, 0.0159709134546, 0.0355353842... False 0
dde24035-cf29-4dab-aeff-3696dd140d70 sindaci martino sorbaro http://www.research.ed.ac.uk/portal/en/persons... unknown neuroinformatics dtc institute for adaptive and neural computation sindaci martino sorbaro 4 sindaci, m. s. {277a9dfd-f157-4e34-bf15-e5f85e0acafe, 01fa661... [unsupervis, spike, sort, larg, scale, high, d... [(40, 1), (92, 6), (118, 1), (158, 1), (213, 1... [0.181096826706, 0, 0, 0.0924725310858, 0.0813... False 17
dde6eac2-ecfc-40da-84e3-f0355547c99b selega alina http://www.research.ed.ac.uk/portal/en/persons... unknown neuroinformatics dtc institute for adaptive and neural computation selega alina 4 selega, a. {3c57e688-6a1b-4362-bf19-6e5e1646c22f, f6a0daa... [trend, challeng, comput, rna, biolog, report,... [(1, 1), (24, 1), (49, 3), (67, 1), (87, 1), (... [0.428167914563, 0, 0, 0.028343233726, 0, 0.15... False 0
dfd98db5-9b1d-43d3-ad29-8a584195cdb8 storkey amos http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics UNKNOWN storkey amos 0 j. storkey, a.|storkey, a. j.|storkey, a. {474d1aa2-a24e-412c-84c7-97e66eb85e21, b178ff5... [dataset, shift, machin, learn, dataset, shift... [(0, 1), (3, 4), (5, 1), (11, 7), (15, 1), (16... [0.0702817319263, 0.0119366454002, 0.026451055... False 8
dfee7091-0d56-4b1e-a821-5f0f0b37ea02 simpson ian http://www.research.ed.ac.uk/portal/en/persons... research fellow school of informatics centre for intelligent systems and their appli... simpson ian 1 simpson, i. t.|simpson, t. i.|simpson, t.|simp... {cf5f6e97-68ab-480e-94a1-f37971379ae5, 9edd074... [communiti, detect, identifi, subnetwork, syna... [(6, 1), (11, 3), (14, 1), (17, 1), (20, 4), (... [0.174445103154, 0, 0, 0, 0.0289257627289, 0.0... False 11
e111bbd0-4d9c-45a9-8a1c-6ed33c871568 stark ian http://www.research.ed.ac.uk/portal/en/persons... senior lecturer school of informatics institute for computing systems architecture stark ian 5 stark, i. d. b.|stark, i.|stark, i. (ed.) {6c660b1a-7096-4dbc-9e86-23d7c870ceea, 44cde56... [continu, calculus, process, algebra, biochem,... [(6, 1), (15, 1), (29, 3), (35, 1), (39, 1), (... [0.0875410057191, 0.302670075625, 0.0586301087... False 1
e346d85f-17fb-4bfb-b94a-7e1891987ac2 keller frank http://www.research.ed.ac.uk/portal/en/persons... personal chair in congnitive science school of informatics laboratory for foundations of computer science keller frank 3 keller, f. {3bfae656-9ca1-4252-8a50-0cc2593385b0, bbfa696... [scan, pattern, predict, sentenc, product, cro... [(0, 1), (3, 3), (9, 1), (11, 2), (15, 3), (22... [0, 0, 0.0226011755965, 0.0134661752633, 0.160... False 9
e4fface3-9781-4bfa-9a40-9340f707cde3 arapinis myrto http://www.research.ed.ac.uk/portal/en/persons... lecturer school of informatics laboratory for foundations of computer science arapinis myrto 3 arapinis, m. {61ac6189-a231-4413-9642-e377cfab6079, 70b13aa... [one, session, mani, dynam, tag, secur, protoc... [(5, 1), (7, 5), (15, 1), (16, 2), (29, 15), (... [0.0478941927647, 0.232807402501, 0.0598030057... False 10
e554cc1b-7106-4355-afce-280d0a04e34a etessami kousha http://www.research.ed.ac.uk/portal/en/persons... personal chair in algorithms games logic and c... school of informatics UNKNOWN etessami kousha 0 etessami, k. {42017477-ca12-4bf0-8e31-4193374acfd1, f9ae20e... [reachabl, power, local, order, l, l, nl, nl, ... [(3, 1), (6, 4), (11, 3), (12, 1), (15, 2), (2... [0, 0.577181303942, 0.0744995509902, 0.0240366... False 1
e76a7c40-a177-4eb5-9ac3-12ab104895f8 perera roland http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics institute for computing systems architecture perera roland 5 perera, r. {c9195814-962d-4414-bae1-7815acda7cce, 379f5ac... [typecheck, protocol, mungo, stmungo, report, ... [(16, 1), (22, 1), (29, 2), (40, 1), (56, 1), ... [0, 0.232714730994, 0.0415418383123, 0.0500676... False 1
ea253c8d-bdc3-47fa-9091-8e0c9512f345 schweikert gabriele http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics institute of language cognition and computation schweikert gabriele 2 schweikert, g. {16a03a18-38fa-4870-9999-b1a2ee85a2df, e91b0cb... [dgw, exploratori, data, analysi, tool, cluste... [(1, 1), (3, 2), (11, 1), (14, 1), (24, 5), (2... [0.176517855549, 0, 0, 0.0263686409299, 0.1207... False 11
ec061db1-98b7-4ad8-9ba3-257f390dbe34 heafield kenneth http://www.research.ed.ac.uk/portal/en/persons... lecturer in data science school of informatics UNKNOWN heafield kenneth 0 heafield, k. {ebe6301c-0085-4cf5-a085-eb054904e1a3, 404b745... [univers, edinburgh, neural, mt, system, wmt, ... [(4, 1), (15, 1), (40, 1), (43, 1), (44, 5), (... [0, 0, 0.052971735447, 0.0877102298695, 0, 0.0... False 16
ecd799fb-4f63-44ae-a078-b009099f2c8c alex beatrice http://www.research.ed.ac.uk/portal/en/persons... research fellow school of informatics institute of language cognition and computation alex beatrice 2 alex, b.|alex, b. (ed.) {e390989e-217c-4180-b20c-202e50c3af96, 4700ef0... [exploit, multipli, annot, corpora, biomed, in... [(3, 1), (8, 7), (11, 1), (14, 17), (23, 1), (... [0.0635860631745, 0, 0.0557625811584, 0, 0.046... False 16
ee01bccc-c4d2-45f8-a81c-9c634464a623 lapata mirella http://www.research.ed.ac.uk/portal/en/persons... personal chair in natural language processing school of informatics institute of language cognition and computation lapata mirella 2 lapata, m. {d2534162-b055-4fc8-b825-37c83fcbab2c, 0d85d86... [similaritydriven, semant, role, induct, via, ... [(3, 1), (6, 11), (8, 8), (11, 5), (17, 1), (1... [0, 0.0170127885081, 0.0166623626781, 0.021505... False 9
ef0fbd5f-acbb-433e-9cd7-4cfd8d3fc513 shimodaira hiroshi http://www.research.ed.ac.uk/portal/en/persons... lecturer school of informatics institute for adaptive and neural computation shimodaira hiroshi 4 shimodaira, h. {3adcaa40-810f-471a-b1cd-7e9456d5067c, cc12d9b... [univers, edinburgh, speaker, person, mocap, d... [(0, 1), (2, 6), (3, 1), (30, 7), (36, 1), (46... [0, 0, 0.0516818483678, 0.0398306903894, 0.030... False 15
efbfd31a-4fd3-450c-9761-7871d7026cf5 radu valentin http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics neuroinformatics dtc radu valentin 6 radu, v. {1b5f1666-c71c-40bf-b9f7-9d82db2fa897, e81054a... [pazl, mobil, crowdsens, base, indoor, wifi, m... [(45, 2), (46, 3), (98, 2), (131, 1), (133, 1)... [0.330625824176, 0, 0.0346304667593, 0.0414545... False 0
f0a3769d-bc61-42f4-b06a-8b90b20a44dc calautti marco http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics laboratory for foundations of computer science calautti marco 3 calautti, m. {6c065861-fba1-4bd8-abca-79dc41d35d77, 559845d... [chase, termin, guard, existenti, rule, chase,... [(6, 8), (15, 1), (45, 2), (69, 4), (72, 1), (... [0.0237619862575, 0.254387533826, 0.1891415933... False 18
f0e55345-3da8-451d-bd30-8b1068015efc koehn philipp http://www.research.ed.ac.uk/portal/en/persons... personal chair in machine translation school of informatics institute of perception action and behaviour koehn philipp 7 koehn, p. {d6453dbc-1004-4d55-944d-c9a90e063567, 0609fd5... [find, workshop, statist, machin, translat, pa... [(0, 4), (1, 1), (3, 6), (4, 2), (6, 2), (11, ... [0, 0, 0.0391016121553, 0, 0, 0.046304757153, ... False 16
f2d0a050-bb05-469c-a97c-007e1f744ba4 manataki areti http://www.research.ed.ac.uk/portal/en/persons... senior researcher school of informatics institute for computing systems architecture manataki areti 5 manataki, a. {1b97dc9d-169e-466b-90b9-0ae44c733453, 8007f7a... [design, social, machin, heart, manual, servic... [(14, 1), (15, 1), (16, 1), (22, 1), (23, 2), ... [0.121236425459, 0, 0.043510187575, 0.05126310... False 19
f598be85-b9b9-42b2-b09c-c70eafa6ee8d heil katharina http://www.research.ed.ac.uk/portal/en/persons... unknown neuroinformatics dtc institute for adaptive and neural computation heil katharina 4 heil, k. f.|heil, k. {ac5ee121-76f4-4b2d-a7b9-d070a132e99a, cf5f6e9... [system, biolog, approach, parkinson, diseas, ... [(11, 1), (17, 2), (40, 4), (49, 5), (66, 1), ... [0.145662283083, 0, 0, 0, 0.0140729699547, 0.0... False 11
f66a2291-496c-4371-a060-ef2d68c379f3 ricciotti wilmer http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics centre for intelligent systems and their appli... ricciotti wilmer 1 ricciotti, w. {a1f5b7b9-95f7-4c56-af67-49e2135f0fa9, e855ca6... [formal, result, chebyshev, number, theori, di... [(11, 1), (28, 2), (29, 1), (40, 2), (45, 1), ... [0, 0.297735654106, 0.133805551664, 0, 0, 0, 0... False 1
f804a9ce-5cc1-456b-b652-a91c8e034c68 birch alexandra http://www.research.ed.ac.uk/portal/en/persons... senior researcher school of informatics institute of language cognition and computation birch alexandra 2 birch-mayne, a.|birch, a. {d9c45a19-bce8-435e-8c93-7881e9fcbfd4, b3a721f... [combin, spoken, languag, translat, eu, bridge... [(3, 2), (6, 1), (9, 1), (15, 1), (29, 1), (30... [0, 0, 0.0183632864354, 0.0280137142341, 0, 0.... False 16
f87c8f49-7cb3-4d73-8133-8d3c98304c0b sinclair mark http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics institute for computing systems architecture sinclair mark 5 sinclair, m. {903f2107-febc-47c3-bb68-f6ab43f48af6, 0099f9d... [uedin, asr, system, iwslt, evalu, paper, desc... [(15, 1), (18, 1), (30, 2), (44, 1), (45, 1), ... [0.0103070866522, 0, 0.0540130292354, 0.023869... False 16
f8c8787b-838f-4485-a4ee-357b7406d75b crowley elliot http://www.research.ed.ac.uk/portal/en/persons... reserach associate school of informatics institute for adaptive and neural computation crowley elliot 4 crowley, e. j.|crowley, e. {1b1955cd-2686-4493-9800-51c71317323d, d6404a3... [face, paint, queri, art, photo, studi, proble... [(28, 1), (46, 1), (118, 1), (135, 7), (146, 3... [0, 0, 0.125164808025, 0, 0.506957893726, 0.18... False 4
fb2008c2-111d-48c0-aec8-3586828b2aef mcneill fiona http://www.research.ed.ac.uk/portal/en/persons... visitor official visitor school of informatics institute of language cognition and computation mcneill fiona 2 mcneill, f. {bcd52540-776d-4f56-9ae3-5ee1fa7c1af2, 1970ab4... [diagnos, repair, ontolog, mismatch, develop, ... [(11, 1), (14, 1), (17, 1), (25, 1), (40, 1), ... [0.023295220604, 0.0129520271327, 0.0169445770... False 14
fb3d0e26-d92f-434e-b79e-c2a712e1d328 penkov svetlin http://www.research.ed.ac.uk/portal/en/persons... unknown neuroinformatics dtc UNKNOWN penkov svetlin 0 penkov, s. {a492e783-0183-4493-aebb-b0df1a585efe, c514937... [ground, symbol, multimod, instruct, robot, be... [(52, 1), (67, 3), (109, 1), (114, 1), (118, 1... [0, 0, 0, 0.0460849472519, 0.0628845974812, 0,... False 9
fde737e9-3815-4539-9a22-1666d18eb4c7 series peggy http://www.research.ed.ac.uk/portal/en/persons... senior lecturer school of informatics institute for adaptive and neural computation series peggy 4 seriès, p.|seriés, p.|series, p. {d83f9106-2abf-4fed-8d13-7915704dbf4f, 1b0f653... [detect, quantifi, topograph, order, brain, to... [(6, 2), (11, 5), (16, 6), (21, 1), (23, 5), (... [0.0331270860386, 0, 0.0174760770029, 0, 0.082... False 17
fdf75867-e658-48aa-96cd-633b4e66abf3 fourman michael http://www.research.ed.ac.uk/portal/en/persons... professor school of informatics institute for computing systems architecture fourman michael 5 fourman, m.|fourman, m. p. {9e0feb3c-aae2-45c8-9eaf-4c6460e489d6, 2b334cc... [cad, tool, futur, algorithm, softwar, archite... [(14, 10), (15, 2), (22, 1), (29, 2), (35, 1),... [0.0450313010703, 0.344235504629, 0.0873065882... False 1

201 rows × 15 columns

In [191]:
lookup_poinf_more_drop.iloc[classDef==0]
Out[191]:
last_name first_name perseonal_url position parent institute full_name institute_class alias pub_ids summary_toks bow topic_distribution remove_drop best_topic
id
02c86de2-0fc9-4f6d-aee9-93b0f7557c84 franke bjoern http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics institute of language cognition and computation franke bjoern 2 franke, b. {9a3368cc-e69d-4ecf-bad1-b43ab0ac89a8, ab3fccd... [use, genet, program, sourcelevel, data, assig... [(1, 1), (3, 5), (6, 2), (10, 1), (11, 1), (13... [0.0467349416537, 0, 0.0134058642663, 0.685572... False 3
0b2fae7b-cf7f-4f8b-a92c-4fa055ff9d63 bhatotia pramod http://www.research.ed.ac.uk/portal/en/persons... senior lecturer in computing systems architecture school of informatics institute for computing systems architecture bhatotia pramod 5 bhatotia, p. {ea82758c-a623-40b4-9a11-1147103f2949, bedeb6e... [increment, mapreduc, comput, larg, scale, big... [(23, 1), (28, 1), (34, 1), (45, 4), (48, 1), ... [0.0244220477869, 0, 0, 0.398082836141, 0, 0, ... False 3
0d940898-e2a7-4262-bf06-5b146fb79ba2 spink tom http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics institute of language cognition and computation spink tom 2 spink, t. {ab3fccd9-365d-4757-8d2c-f62aaeb59791, e6a0f61... [hardwar, acceler, crossarchitectur, fullsyste... [(13, 3), (15, 2), (45, 1), (58, 1), (65, 3), ... [0.0507503450478, 0, 0, 0.631969091882, 0, 0, ... False 3
0ed800f5-a3a0-47d7-a8b3-f97a4f2b6931 steuwer michel http://www.research.ed.ac.uk/portal/en/persons... unknown institute for computing systems architecture laboratory for foundations of computer science steuwer michel 3 steuwer, m. {20cb2fdd-6d93-40b9-9cab-e9d818eb166e, b74a3be... [highlevel, program, medic, imag, multigpu, sy... [(5, 31), (6, 1), (11, 1), (15, 2), (28, 9), (... [0.0121417985093, 0, 0.0332301562306, 0.620844... False 3
12b8d4c5-226d-430f-8c55-554aa75fcda8 joshi arpit http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics laboratory for foundations of computer science joshi arpit 3 joshi, a. {aece86e1-ac7e-4bd9-8065-0e6386906749, fdc2a8d... [function, transcript, factor, target, discove... [(14, 1), (45, 2), (55, 3), (58, 2), (65, 1), ... [0.11343552675, 0, 0, 0.43129692923, 0.0342080... False 3
2026e8cc-5cb9-4397-b9a9-490ed120e2a5 grot boris http://www.research.ed.ac.uk/portal/en/persons... lecturer in informatics school of informatics institute of language cognition and computation grot boris 2 grot, b. {90a9d1e9-1fdc-47dd-8dc0-89dfca0a1828, e4706f7... [scaleout, processor, scale, datacent, mandat,... [(14, 1), (15, 3), (17, 1), (25, 3), (32, 2), ... [0.0844472337689, 0, 0, 0.563768078615, 0.0217... False 3
3d2352c5-8e16-4434-8802-67d74d0a4b36 viglas stratis http://www.research.ed.ac.uk/portal/en/persons... personal chair of data management on new hardware school of informatics laboratory for foundations of computer science viglas stratis 3 viglas, s. d.|viglas, s. {b047bfa3-0bb7-4686-82d1-08161610ab3f, 21a6def... [model, multithread, queri, execut, chip, mult... [(5, 2), (11, 2), (14, 4), (22, 1), (23, 1), (... [0.0258041796291, 0, 0.0348903432504, 0.340716... False 3
412b9b6c-a9d5-47d6-81af-018323057f36 bodin bruno http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics institute for computing systems architecture bodin bruno 5 bodin, b. {6b26c412-60a4-49de-b0de-56228e87e068, e45d745... [live, evalu, cyclostat, dataflow, graph, cycl... [(15, 1), (24, 2), (28, 4), (29, 1), (40, 1), ... [0.0738652190376, 0.0912953966609, 0, 0.401687... False 3
489d4278-a0a6-4e8b-857a-4ee0e800766f wagstaff harry http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics institute of language cognition and computation wagstaff harry 2 wagstaff, h. {ab3fccd9-365d-4757-8d2c-f62aaeb59791, 1f8be5b... [hardwar, acceler, crossarchitectur, fullsyste... [(13, 4), (15, 2), (45, 1), (58, 1), (65, 3), ... [0.0516633484039, 0, 0, 0.656913243918, 0.0201... False 3
68ca9564-52a9-46ce-92af-0480a43b555d cole murray http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics institute for computing systems architecture cole murray 5 cole, m.|cole, m. i.|cole, m. u. r. r. a. y. {d760e34b-9961-443b-9d39-477ff1f2f7ef, 273132e... [structur, approach, model, perform, system, u... [(3, 1), (5, 39), (6, 1), (11, 5), (15, 1), (2... [0.0416107383395, 0.0637323834977, 0.040842143... False 3
6dc0617d-43af-4c3a-88a6-aadf5bd8f57b smith aaron http://www.research.ed.ac.uk/portal/en/persons... reader in computing systems architecture school of informatics institute for computing systems architecture smith aaron 5 smith, a. {b2650c3e-425d-414a-9b7b-305d238bd694, ff72dc6... [machin, learn, approach, map, stream, workloa... [(13, 1), (14, 10), (17, 2), (23, 1), (28, 13)... [0.0691003822472, 0, 0, 0.567096757548, 0.0157... False 3
754637e1-0756-4d21-9afe-732b294e303f dubach christophe http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics institute for computing systems architecture dubach christophe 5 dubach, c. {b2650c3e-425d-414a-9b7b-305d238bd694, fa62108... [machin, learn, approach, map, stream, workloa... [(3, 2), (11, 4), (13, 3), (14, 1), (15, 2), (... [0.0535716362631, 0, 0, 0.691537160327, 0.0227... False 3
c18e1d0a-166d-4615-b7f8-4eb02e964656 leather hugh http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics centre for intelligent systems and their appli... leather hugh 1 leather, h. {0b8939ab-2a63-456a-b883-29c10cfc122c, b973d73... [optim, space, explor, fastflow, parallel, ske... [(5, 14), (11, 1), (14, 1), (16, 1), (23, 1), ... [0.0863808447963, 0.026313977147, 0.0232772770... False 3
d15a43ba-c1c3-46bd-af99-55483ee5d119 petoumenos pavlos http://www.research.ed.ac.uk/portal/en/persons... research associate school of informatics institute of perception action and behaviour petoumenos pavlos 7 petoumenos, p. {66350be6-d95a-424f-84e8-7243be29c53b, f5858ce... [alea, finegrain, energi, profil, tool, energi... [(3, 1), (5, 1), (11, 4), (14, 1), (15, 2), (1... [0.0947015754768, 0, 0.0381563617703, 0.563898... False 3
d8071e52-0b22-4b1a-b4aa-3b04ba9c75b7 topham nigel http://www.research.ed.ac.uk/portal/en/persons... chair of computer systems school of informatics institute of language cognition and computation topham nigel 2 topham, n. p.|topham, n.|topham, n. p. (ed.) {bef92040-cde7-4f42-9cbf-d62192de2820, a2a8229... [earli, resolv, instruct, techniqu, disclos, h... [(3, 3), (6, 1), (10, 1), (11, 3), (13, 3), (1... [0.0455292929915, 0.0157290767137, 0.028626599... False 3
e184b211-081f-450b-8931-b0471a0e0c29 sreekar shenoy govind http://www.research.ed.ac.uk/portal/en/persons... unknown institute for computing systems architecture institute for adaptive and neural computation sreekar shenoy govind 4 sreekar shenoy, g.|s., g.|shenoy, g. s. {5bf2d627-928e-4b29-a4d4-9ae189f8384a, 7579406... [exploit, tempor, local, network, traffic, use... [(11, 1), (15, 1), (141, 2), (155, 2), (156, 1... [0, 0, 0.0765566822572, 0.325480398647, 0.0572... False 3
eb085c28-d4a0-4d51-83e8-881a148e7fff kumar rakesh http://www.research.ed.ac.uk/portal/en/persons... unknown institute for computing systems architecture UNKNOWN kumar rakesh 0 kumar, r. {858f08e1-6a3b-45ed-a970-a41be052ac68, ff971c6... [assist, static, compil, vector, specul, dynam... [(15, 1), (17, 1), (24, 1), (26, 1), (28, 49),... [0.103331177705, 0, 0.0252408426641, 0.5689503... False 3
f53b1ad8-3f19-4e7e-8a3a-f7996de038c0 nagarajan vijay http://www.research.ed.ac.uk/portal/en/persons... reader school of informatics centre for intelligent systems and their appli... nagarajan vijay 1 nagarajan, v. {a603b6a9-d7bb-42e8-b343-7e307074618e, 464e8bb... [understand, effect, data, corrupt, applic, be... [(3, 1), (6, 5), (10, 1), (13, 2), (17, 2), (2... [0.0285011260705, 0.0124352771291, 0.025752382... False 3